May 23, 2017

Reproducibility: who cares?

Science retracts gay marriage paper without agreement of lead author LaCour

  • In May 2015 Science retracted a study of how canvassers can sway people's opinions about gay marriage published just 5 months earlier.

  • Science Editor-in-Chief Marcia McNutt: Original survey data not made available for independent reproduction of results. + Survey incentives misrepresented. + Sponsorship statement false.

  • Two Berkeley grad students who attempted to replicate the study quickly discovered that the data must have been faked.

  • Methods we'll discuss today can't prevent this, but they can make it easier to discover issues.

Source: http://news.sciencemag.org/policy/2015/05/science-retracts-gay-marriage-paper-without-lead-author-s-consent

Bad spreadsheet merge kills depression paper, quick fix resurrects it

Divorce study felled by a coding error gets a second chance

Divorce study retraction: Editor's note

  • "The research environment is fast-paced given the ethos to “publish or perish"."

  • "[…] research is becoming increasingly complex, with greater calls for transdisciplinary collaborations, “big data,” and more sophisticated research questions and methods […] data sets often have multiple files that require merging, change the wording of questions over time, provide incomplete codebooks, and have unclear and sometimes duplicative variables."

  • "Given these issues, I would not be surprised if coding errors were fairly common […]"



Source: http://retractionwatch.com/2015/09/10/divorce-study-felled-by-a-coding-error-gets-a-second-chance/#more-32151

Reproducibility: why should you care?

Think back to every time…

  • The results in Table 1 don't seem to correspond to those in Figure 2.
  • In what order do I run these scripts?
  • Where did we get this data file?
  • Why did I omit those samples?
  • How did I make that figure?
  • "Your script is now giving an error."
  • "The attached is similar to the code we used."



Source: Karl Broman

No collaborators?





Your closest collaborator is you six months ago,
but you don’t reply to emails.

- Mark Holder




Reproducibility: how?

Reproducibility checklist

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
  • Can the code be used for other data?
  • Can you extend the code to do other things?

Ambitious goal + many other concerns

We need an environment where

  • data, analysis, and results are tightly connected, or better yet, inseparable

  • reproducibility is built in
    • the original data remains untouched
    • all data manipulations and analyses are inherently documented
  • documentation is human readable and syntax is minimal

Toolkit

Outline

  1. Scriptability \(\rightarrow\) R

  2. Literate programming \(\rightarrow\) R Markdown

  3. Version control \(\rightarrow\) Git / GitHub

1. Scriptability

Point-and-click vs. scripting

  • Learning curve: Point-and-click software (supposedly) have shallower learning curves than scripting languages

  • Documentation: At a minimum, your code documents your analysis
    • And you can do better with comments and README files
  • Automation: Need to rerun your analysis with new/updated data? Just change the input file.

  • Collaboration: Sharing your analysis is as easy as sharing your scripts

Why R?

  • Programming language for data analysis
  • Free!
  • Open source
  • Widely used and supported across all disciplines
  • Can be used on Windows, Mac OS X, or Linux
  • Thousands of statistical data analysis packages

RSplashScreen

Why not language X?

  • There are a number of other great programming tools out there that can also be used to improve the reproducibility of your analysis

  • The key is to use some type of language that will allow you to automate and document your analysis

  • Once you master one language you'll probably find it easier to learn another

Once in R

You could just type into the command prompt, but that doesn't help much with

  • documentation

or

  • automation

RSplash

2. Literate programming

Donald Knuth "Literate Programming (1983)"

"Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do."

"The practitioner of literate programming […] strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other."

  • These ideas have been around for years!
  • and tools for putting them to practice have also been around
  • but they have never been as accessible as the current tools

A better solution than just R

With RStudio you can combine your programming and your documentation

  • Gives you a single environment to combine your documentation and your analysis
  • Runs on top of R

RSplashScreen

What is Markdown?

  • Markdown is a lightweight markup language for creating HTML (or XHTML) documents.

  • Markup languages are designed to produce documents from human readable text (and annotations).

  • Some of you may be familiar with LaTeX. This is another (less human friendly) markup language for creating pdf documents.

  • Why I love Markdown:
    • Simple syntax means easy to learn and use.
    • Focus on content, rather than coding and debugging errors.
    • Allows for easy web authoring.
    • Once you have the basics down, you can get fancy and add HTML, JavaScript, and CSS.

Sample Markdown document

markdown

What is R Markdown?

Well, it's R + Markdown:

  • Ease of Markdown syntax

  • Rendering of R code to produce output and plots

Sample R Markdown document

rmarkdown

Another R Markdown document





This presentation!





Example: Big Five Personality Test

The Big Five personality traits is a theory of five broad dimensions used by some psychologists to describe the human personality and psyche: openness to experience, conscientiousness, extraversion, agreeableness, and neuroticism.



Load data with an R chunk:

big5 <- read.delim("raw-data/big5.txt") %>%
  tbl_df() # for formatting



Sources: Wikipedia and http://personality-testing.info/_rawdata/.

Under the hood

rchunk

View data

big5
## # A tibble: 19,719 × 57
##     race   age engnat gender  hand source country    E1    E2    E3    E4
##    <int> <int>  <int>  <int> <int>  <int>  <fctr> <int> <int> <int> <int>
## 1      3    53      1      1     1      1      US     4     2     5     2
## 2     13    46      1      2     1      1      US     2     2     3     3
## 3      1    14      2      2     1      1      PK     5     1     1     4
## 4      3    19      2      2     1      1      RO     2     5     2     4
## 5     11    25      2      2     1      2      US     3     1     3     3
## 6     13    31      1      2     1      2      US     1     5     2     4
## 7      5    20      1      2     1      5      US     5     1     5     1
## 8      4    23      2      1     1      2      IN     4     3     5     3
## 9      5    39      1      2     3      4      US     3     1     5     1
## 10     3    18      1      2     1      5      US     1     4     2     5
## # ... with 19,709 more rows, and 46 more variables: E5 <int>, E6 <int>,
## #   E7 <int>, E8 <int>, E9 <int>, E10 <int>, N1 <int>, N2 <int>, N3 <int>,
## #   N4 <int>, N5 <int>, N6 <int>, N7 <int>, N8 <int>, N9 <int>, N10 <int>,
## #   A1 <int>, A2 <int>, A3 <int>, A4 <int>, A5 <int>, A6 <int>, A7 <int>,
## #   A8 <int>, A9 <int>, A10 <int>, C1 <int>, C2 <int>, C3 <int>, C4 <int>,
## #   C5 <int>, C6 <int>, C7 <int>, C8 <int>, C9 <int>, C10 <int>, O1 <int>,
## #   O2 <int>, O3 <int>, O4 <int>, O5 <int>, O6 <int>, O7 <int>, O8 <int>,
## #   O9 <int>, O10 <int>

Clean data

You can include script files in your R Markdown document

source("code/01-data-cleanup.R")

script

View distribution of age

ggplot(big5, aes(x = age)) +
  geom_histogram()

summary(big5$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   18.00   22.00   26.26   31.00   99.00

Regress extraversion vs. neuroticism and gender

Extraversion: Seeking fulfillment from sources outside the self or in community. High scorers are social, low scorers prefer to work alone. Neuroticism: Being emotional.

m_ext_age <- lm(extraversion ~ neuroticism * gender, data = big5)
summary(m_ext_age)
## 
## Call:
## lm(formula = extraversion ~ neuroticism * gender, data = big5)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.3125  -6.3391   0.0132   6.6079  26.0924 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)
## (Intercept)             15.202758   0.190240  79.913  < 2e-16
## neuroticism              0.297346   0.009615  30.925  < 2e-16
## genderMale              -1.893017   0.327308  -5.784 7.42e-09
## genderOther             -5.721794   2.177580  -2.628  0.00861
## neuroticism:genderMale   0.001576   0.015226   0.104  0.91755
## neuroticism:genderOther -0.008332   0.125205  -0.067  0.94694
## 
## Residual standard error: 8.854 on 19605 degrees of freedom
##   (24 observations deleted due to missingness)
## Multiple R-squared:  0.08003,    Adjusted R-squared:  0.0798 
## F-statistic: 341.1 on 5 and 19605 DF,  p-value: < 2.2e-16

Plot extraversion vs. age and gender

ggplot(data = big5, aes(x = neuroticism, y = extraversion, color = gender)) +
  geom_point(alpha = 0.5) +
  geom_jitter() +
  geom_smooth(method = "lm")

Suppose you want only teens

big5_teen <- filter(big5, age <= 19)
m_ext_age_teen <- lm(extraversion ~ age * gender, data = big5_teen)
summary(m_ext_age_teen)
## 
## Call:
## lm(formula = extraversion ~ age * gender, data = big5_teen)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.8426  -6.9399   0.0037   7.0601  22.6662 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     14.12536    1.43788   9.824  < 2e-16
## age              0.30091    0.08502   3.539 0.000404
## genderMale       6.78702    2.47559   2.742 0.006131
## genderOther      6.66006   11.01228   0.605 0.545342
## age:genderMale  -0.42066    0.14590  -2.883 0.003949
## age:genderOther -0.76174    0.66364  -1.148 0.251085
## 
## Residual standard error: 9.366 on 6740 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.005666,   Adjusted R-squared:  0.004929 
## F-statistic: 7.681 on 5 and 6740 DF,  p-value: 3.274e-07

Plot for only teens

ggplot(data = big5_teen, aes(x = neuroticism, y = extraversion, color = gender)) +
  geom_point(alpha = 0.5) +
  geom_jitter() +
  geom_smooth(method = "lm")

3. Version control

What is version control?

Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.

Bad

Good

    2013-10-14_manuscriptFish.doc
    2013-10-30_manuscriptFish.doc
    2013-11-05_manusctiptFish_intitialRyanEdits.doc
    2013-11-10_manuscriptFish.doc
    2013-11-11_manuscriptFish.doc
    2013-11-15_manuscriptFish.doc
    2013-11-30_manuscriptFish.doc
    2013-12-01_manuscriptFish.doc
    2013-12-02_manuscriptFish_PNASsubmitted.doc
    2014-01-03_manuscriptFish_PLOSsubmitted.doc
    2014-02-15_manuscriptFish_PLOSrevision.doc
    2014-03-14_manuscriptFish_PLOSpublished.doc

Better - Saving everything together at once

Everytime you make a save, you zip the entire directory that your project files are in and save it with a date.

Best - Version Control

How does a version control system work?

  • Start with a base version of the document, save just the changes you made at each step of the way.

  • Think of it as a tape: if you rewind the tape and start at the base document, then you can play back each change and end up with your latest version.

  • "Playing back" different sets of changes onto the base document and getting different versions of the document.

Source: Software Carpentry.

Git/GitHub

  • Easy to set up
  • Integrated with RStudio
  • GitHub's strong community: your colleagues are probably already there
  • Provides tools to help enhance collaboration
  • A common location to share your work

Commits

Diff

Parting remarks

Two-pronged approach

Everyone struggles with reproducibility and it is a hindrance to moving science forward.

#1 Adopt a reproducible research workflow



#2 Train new researchers who don’t have any other workflow

two prongs

Resources

Latex

\(\hat{y} = \beta_0 + \beta_1 \times x\)